AITopics | Audio & Video

Collaborating Authors

Audio & Video

IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Neural Information Processing SystemsMar-24-2025, 06:50:16 GMT

Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, partconditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia (0.14)

Genre:

Instructional Material > Training Manual (0.48)
Research Report > New Finding (0.46)

Industry:

Retail (1.00)
Banking & Finance (0.67)
Education > Educational Technology > Audio & Video (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

VideoGUI: A Benchmark for GUI Automation from Instructional Videos Kevin Qinghong Lin

Neural Information Processing SystemsMar-23-2025, 01:51:20 GMT

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descriptions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. Our evaluation on VideoGUI reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks, especially for high-level planning.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country: Asia (0.14)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.61)

Industry:

Education > Educational Technology > Audio & Video (0.71)
Education > Educational Technology > Media (0.61)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(3 more...)

Add feedback

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Neural Information Processing SystemsMar-17-2025, 03:45:15 GMT

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Pho- toshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descrip- tions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements.

artificial intelligence, instructional video, machine learning, (8 more...)

Neural Information Processing Systems

Industry:

Education > Educational Technology > Media (0.64)
Education > Educational Technology > Audio & Video (0.64)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Human Computer Interaction > Interfaces (0.61)
Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Do's and Don'ts: Learning Desirable Skills with Instruction Videos

Neural Information Processing SystemsMar-16-2025, 18:19:17 GMT

Unsupervised skill discovery is a learning paradigm that aims to acquire diverse behaviors without explicit rewards. However, it faces challenges in learning complex behaviors and often leads to learning unsafe or undesirable behaviors. For instance, in various continuous control tasks, current unsupervised skill discovery methods succeed in learning basic locomotions like standing but struggle with learning more complex movements such as walking and running. Moreover, they may acquire unsafe behaviors like tripping and rolling or navigate to undesirable locations such as pitfalls or hazardous areas. In response, we present DoDont (Do's and Dont's), an instruction-based skill discovery algorithm composed of two stages.

artificial intelligence, machine learning, skill discovery algorithm, (4 more...)

Neural Information Processing Systems

Industry: Education > Educational Technology > Audio & Video (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.42)

Add feedback

COBE: Contextualized Object Embeddings from Narrated Instructional Video Supplementary Materials

Neural Information Processing SystemsJan-27-2025, 13:40:47 GMT

Our supplementary materials consist of: 1. Implementation Details. We train our model for 10 epochs with an initial learning rate of 0.001, a linear warmup of 500 steps and a momentum of 0.9. We use a multi-scale training approach implemented by resizing the shorter side of the frame randomly between 400 and 800 pixels. Our model is trained in a distributed setting using 64 GPUs, each GPU holding a single frame. We initialize our model with a Faster R-CNN pretrained on COCO for object detection.

artificial intelligence, machine learning, natural language, (11 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.15)
Africa > Ethiopia (0.15)

Genre: Instructional Material > Course Syllabus & Notes (0.41)

Industry:

Education > Educational Technology > Media (0.41)
Education > Educational Technology > Audio & Video (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Review for NeurIPS paper: COBE: Contextualized Object Embeddings from Narrated Instructional Video

Neural Information Processing SystemsJan-27-2025, 13:40:42 GMT

While this algorithm is specifically designed for detectors, Miech et al 2019 used unsupervised NCE losses (much like the ones in this paper) in order to understand the natural language descriptions associated with videos; the algorithm presented here seems like the most straightforward extension of this idea to bounding boxes. Little attention is given to demonstrating that the use of bounding boxes fundamentally changes the problem. Update The rebuttal addresses the following point regarding the accuracy of the evaluation. I had misunderstood the annotations that are available with epic kitchens, and therefore I am changing my review. I would encourage the authors to clarify the writing regarding what's available with epic kitchens.

artificial intelligence, benchmark, contextualized object embedding, (10 more...)

Neural Information Processing Systems

Industry:

Education > Educational Technology > Media (0.40)
Education > Educational Technology > Audio & Video (0.40)

Technology: Information Technology > Artificial Intelligence (0.39)

Add feedback

COBE: Contextualized Object Embeddings from Narrated Instructional Video Facebook AI, 2

Neural Information Processing SystemsJan-27-2025, 13:40:42 GMT

Many objects in the real world undergo dramatic variations in visual appearance. For example, a tomato may be red or green, sliced or chopped, fresh or fried, liquid or solid. Training a single detector to accurately recognize tomatoes in all these different states is challenging. On the other hand, contextual cues (e.g., the presence of a knife, a cutting board, a strainer or a pan) are often strongly indicative of how the object appears in the scene. Recognizing such contextual cues is useful not only to improve the accuracy of object detection or to determine the state of the object, but also to understand its functional properties and to infer ongoing or upcoming human-object interactions.

computer vision, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.68)

Genre: Instructional Material > Course Syllabus & Notes (0.42)

Industry:

Education > Educational Technology > Media (0.42)
Education > Educational Technology > Audio & Video (0.42)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Neural Information Processing SystemsJan-19-2025, 23:38:19 GMT

Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or the steps of a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a particular sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional video, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

large language model, natural language, video-mined task graph, (3 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.30)

Industry:

Education > Educational Technology > Media (0.66)
Education > Educational Technology > Audio & Video (0.66)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.30)

Add feedback

IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Liu, Yunong, Eyzaguirre, Cristobal, Li, Manling, Khanna, Shubh, Niebles, Juan Carlos, Ravi, Vineeth, Mishra, Saumitra, Liu, Weiyu, Wu, Jiajun

arXiv.org Artificial IntelligenceNov-18-2024

Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.11409

Genre:

Instructional Material > Training Manual (0.48)
Research Report > New Finding (0.46)

Industry:

Retail (1.00)
Banking & Finance (0.67)
Education > Educational Technology > Audio & Video (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Neural Information Processing SystemsOct-11-2024, 07:55:50 GMT

We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter- and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset.

artificial intelligence, machine learning, self-supervised spatial grounding, (5 more...)

Neural Information Processing Systems

Genre: Instructional Material > Course Syllabus & Notes (0.65)

Industry:

Education > Educational Technology > Media (0.65)
Education > Educational Technology > Audio & Video (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.43)

Add feedback